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ABSTRACT 

An overview of the state of the art in psychological 
research is presented, with an emphasis on the attention given to 
effect sizes. The acceptance of small effect sizes for biomedical 
research is contrasted with the rejection of similar effect sizes for 
psychological research. The Binomial Effect Size Display is used to 
depict the practical magnitude of an effect size regardless of 
whether the dependent variable is dichotomous or continuous. Other 
topics discussed include: (1) tie meaning of successful replication, 
including successful replication of Type II errors; (2) reporting 
results of replications, including tests of significance; (3) 
meta-analytic procedures; (4) sampling bias;. (5) overemphasis on 
single values and disregard of details; (6) problems of heterogeneity 
of method and quality; (7) problems of independence of responses 
within a single study and within sets of studies; and (8) 
exaggeration of significance levels. Several benefits of 
meta-analysis are outlined. It is concluded that many finJings of 
psychological research are neitlr- small nor practically unimportant. 
Nevertheless, it is also conclucied that in the areas of replication 
and of the cumulation of research findings much remains to be done. 
Eight data tables and one graph are provided. A 46-item list of 
references is included. (TJH) 
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My talk today is designed in part both to comfort the afflicted and to afflict the 
comfortable. The afflicted are those of us who work In the softer, wilder areas of our 
field-the areas where the results seem ephemeral and unrephcable, and where the 
r^'sseem always to be approaching zero as a limit. These softer, wilder areas include 
those of social, personality, clinical, developmental, educational, organizational, and 
health psychology. They also include parts of psychobiology and cognitive psy-^ 
chology. These softer, wilder areas, however, may not include too much of 
psychophysics. 

My message to those of us toiling in these muddy vineyards will be that we are 
doing better that we might have thought My message to those of us in any areas in 
which we feel we have pretty well nailed things down will be that we havei't, and 
that we could be doing a whole lot better. 
How Large Must an Effect Be. To Be Important? 

There is a bit of good news-bad news abroad in the land. The good news is that 
more sophisticated editors, referees, and researchers are becoming aware that 
reporting the results of a significance test is not a sufficiently enlighteningprocedure 
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to stand alone. More and more we are beginning to see a report of the magnitude of 
the effect acconTp'anying the p level. The bad news is that we (ire still not quite uure . 
what to do with such a report of the magnitude of the effect, for example, a 
correlation coefficient. 

There is one bit of training that all psychologists have undergone. From under - 
graduate days onward we have all been taught that there is only one proper, decent 
thing to do whenever we see a correlation coefficient- we must square it. For most of 
the softer, wilder areas of psychology, squaring the correlation coefficient tends to ^ 
make it go away-vanish into nothingness as it were. That is one of the sources of-" 
malaise in the social and behavioral sciences. It is sad and quite unnecessary, as we 
shall soon see. 

The Physician's Aspirin Study 

At a special meeting held on December 18, 1987, it was decided to end 
prematurely, a randomized double blind experiment on the effects of aspirin on 
reducing heart attacks (Steering Committee of the Physicians' Health Study 
Research Group, 1988). The reason for this unusual termination of such an experi- 
ment was that it had become so clear that aspirin prevented heart attacks (and 
deaths from heart attacks) that it would be unethical to continue to give half the 
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physician research subjects a placebo. Now what do you suppose was the magnitude 
of the experiratental effect that jvas so dramatic as to call for the termination of this 
research? Was .90, or .80, or .70, or .60, so that the corresponding r's would have 
been .95, .8£, .84, oi .77? No. Well, was 50, .40, .30, or even .20, so that the 
corresponding r's would have been .71, .63, .55, or .45? No. Actually, what r was, 
was .00 1 1 , with a corresponding r of .034. 

Insert Table 1 about here 

Table 1 shows the results of the aspirin study in terms of raw counts, per-«r 
centages, and as a Binomial Effect Size Display (BESD) This display is a way of 
showing the practical importance of any effect indexed by a correlation coefficient. 
The correlation is shown to be the simple difference in outcome rates between the 
experimental and the control groups in this standard table which always adds up to 
column totals of 100 and row totals of 100 (Rosenthal & Rubin, 1982b). 

This type of result seen in the physicians' aspirin study is not at all unusual in 
biomedical research. Some years earlier, on October 29, 1981, the National Heart, 
Lung, and Blood Institute discontinued its placebo-controlled study of propranolol 
because results were so favorable to the treatment that it would be unethical to 
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coatinue withholding the life-saving drug from the control patients. And what was 
the magnitude" of this effect? Once again the effect size r was .04, and the leading 
digits of the were .00! As behavioral researchers we are not used to thinking ofr's 
of .04 as reflecting effect sizes of practical importance. But when we think of an r of 
.04 as reflecting a 4% decrease in heart attacks, the interpretation given r in a 
Binomial Effect Size Display, the r does not appear to be quite so small; especially if 
we can count ourselves among the 4 per 100 who manage to survive. 

Insert Table 2 about here 

Additional Result s 

Table 2 gives three further examples of Binomial Effect Size Displays. In a 
recent study of 4,462 Army veterans of the Vietnam War era (1965-1971), the 
correlation between having served in Vietnam (rather than elsewhere) and having 
suffered from alcohol abuse or dependence was (Centers for Disease Control, 
1988). The top display of Table 2 shows that the difference between the problem 
rates of 53.5 and 46.5 per 100 is equal to the correlation coefficient of .07. 

The center display of Table 2 shows the results of a study of the effects of AZT 
on the survival of 282 patients suffering from AIDS or AIDs-related complex (ARC) 



(Barnes, 1986). This result of a ccrrelation of .23 between survival and receiving 
AZT (an of^e54) was so dramatic as to lead to the premature termination of the 
clinical trial on the ethical grounds that it would be improper to continue to give 
placebo to the control group patients. 

As a footnote to this display let me add the result of a small informal poll I took 
a few weeks ago of some physicians spending the year at the Center for Advanced 
Study in the Behavioral Sciences. I asked them to tell me of some medical break- 
through that was of very great practical importance. Their consensus was that the^ 
breakthrough was the effect of cyclosporine in increasing the probability that the^ 
body would not reject an organ transplant and that the recipient patient would not 
die. A multi-center randomized experiment was published in 1983 (Canadian 
Multicentre Transplant Study Group, 1983). The results of this breakthrough 
experiment were less dramatic than the results of the AZT study. For the dependent 
variable of oigan rejection the effect size r was .19 (r' = .036); for the dependent 
variable of patient survival the effect size r was 15 (r^ = .022). 

The bottom display of Table 2 shows the results of a famous meta-analysis of 
psychotherapy outcome studies reported by Smith and Glass (1977). An eminent 
critic (Rimland, 1979) believed that the results jf their analysis sounded the "death 



knell" for psychotherapy because of the modest size of the effect. This modest effect 
size was an r or.32 accounting for "only 10% of the variance." 

Fixamination of the bottom display of Table 2 shows that it is not very realistic 
to label as^'modest indeed" an effect size equivalent to increasing a success rate from 
34% to 66% (for example, reducing a death rate or a failure rate from 66% to 34%). 
Indeed, as we have seen, the dramatic effects of AZT were substantially smaller (r = 
.23), and the "breakthrough" effects of cyclosporine were smaller still (r = .19). 
Telling How Well WeVe Doing : 

The Binomial Effect Size Display is a useful way to display the practical magni " 
tude of an effect size regardless of whether the dependent variable is dichotomous or 
continuous (Rosenthal & Rubin, 1982b). An especially useful feature of the display 
is how easily we can go from the display to an r (just take the difference between the 
success rates of the experimental versus the control group) and how easily we can go 
from an effect ' ize r to the display (just compute the treatment success rate as .50 
plus one-haif of r and the control success rate as .50 minus one-half of r). 

One effect of the standard use of a display procedure such as the Binomial 
Effect Size Display to index the practical value of our research results would be to 
give us more useful and more realistic assessments of how well we are really doing as 



researchers in the social and behavioral sciences. Employment of the Binomial 
Effect-Size Display has, in fact, shown that we are doing considerably better in our 
"softer, wilder" sciences tharx we may have thought we were doing. 

So far, our conversation has been intended to comfort the afflicted. In what 
follows the intent is a bit more to afflict the comfortable. We begin with the topic of 
replication. 

The Meaning of Successful Replication 
- There is a long tradition in psychology of our urging one another to replicater* 
each other's research. Indeed, there seems to be something nearly scriptural about"*" 
it-I quote: "If a scholar's work be deemed unreplicable then shall ye gladly cast that 
scholar out." (That's from either Referees I or Editors II, I believe.) 

Now, while we have been very good at calling for replications we have not been 
too good at deciding when a replication has been successful. The issue we now 
address is: When shall a studj- be deemed successfully replicated? 

Successful replication k ordinarily taken to mean that a null hypothesis that 
has been rejected at time 1 is rejected again, and with the same direction of outcome, 
on the basis of a new study at time 2. The basic model of this usage can be seen in 
Table 3. The results of the first study are described dichotcmously as p < .05 or p > 



Insert Table 3 about here 

.05 (or some other critical level, e.g., ,01). Each of these two possible outcomes is 
further dichotomized as to the results of the second study as p < .05 or p > .05. Thus, 
cells A and D of Table 3 are examples of failure to replicate because one study was 
significant and the other was not. Let us examine more closely a specific example of 
such a "failure to replicate." 
Pseudo-Failures to Replicate 

The saga of Smith and Jones. Smith has published the results of an experiment^ 
in which a certain treatment procedure was predicted to improve performance. She 
reported results significant at p<.05 in the predicted direction. Jones publishes a 
rebuttal to Smith claiming a failure to replicate. 

Insert Table 4 about here 

Table 4 shows the results of these two experiments in greater detail. Smith's 
results were more significant than Jones's, to be sure, but the studies were in perfect 
agreement as to their estimated sizes of effect as defined either by Cohen's d [(Mean^ 
- Mean ^) / a] or by r, the correlation between group membership and performance 



score (Cohen, 1977; 1988; Rosenthal, 1984). Notonly did the effect sizes of the two 
studies agree7but even the significance levels of .03 and .30 did not differ very 
significantly: (Z ^3 -Z 3^) / V2~= (2.17 - 1.03)/ V2~= Z = .81, p = .42; for details on 
the comparison of significance levels and effect sizes see Rosenthal and Rubin (1979; 
1982a) or a summary in Rosenthal (1984). Table 4 shows very clearly that Jones was 
very much in error when he claimed that his study failed to replicate that of Smith. 
Such errors are made very frequently in most areas of psychology and the other 
behavioral and social sciences. The final column of Table 4 -shows that the combined^-- 
result of both experiments is associated with a more significant t and with a smaller"^ 
confidence interval (for the difference between the means and for the effect size r) 
than is either of the individual studies. 

On the odds against replicating significant results, A related error often found 
in the behavioral and social sciences is the implicit assumption that if an effect is 
'Veal," we should therefore expect it to be found significant again upon replication. 
Nothing could be further from the truth. 

Suppose there is in nature a real effect with a magnitude out there in the world 
of d - .50 (i.e., [Mean^ - Mean.,] / 0 = .50 a units), or, equivalently, r = .24 (a 
difference in success rate of 62% versus 38%). Then suppose an investigator studies 
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this effect with an N of 64 subjects or so, giving the researcher a level of statistical 
power of .50, a-very common level of power for behavioral researchers of the last 30 
years (Cohen, 19G2; Sedlmeier & Gigerenzer, in press). Even though a d of .50 or an 
r of .24 is a very important effect (as we saw earlier in this paper), there is only one 
chance in four that both the original investigator and a replicator will get results 
significant at the .05 level. If there were two replications of the original study there 
would be only one chance in eight that all three studies would be significant, even 
though we know the effect in nature is very real and very important. ^ 

If five studies investigated this phenomenon, there is only a 50:50 chance that^ 
three or more of them would find significant results. In short, given the levels of 
statistical power at which we normally operate, we have no right to expect the 
proportion of significant results that we typically do expect, even if in nature there is 
a very real and very important effect. 
Pseudo-Successful Replications 

Returning now to Table 3, we focus attention on cell B, the cell of "successful 
replication." Suppose that two investigators both rejected the null hypothesis at 
p < .05 with both results in the same direction. Suppose further, however, that in one 
study the effect size r was .90 while in the other study the effect size r was only .10, 



significantly smaller than the r of .90 (Rosenthal & Rubin, 1982a). In this case our 
interpretation is more complex. We have indeed had a successful replication of the 
rejection of the null hypothesis but we have not come even close to a successful 
replication of the effect size. 
" Successful Replication" of Type 2 Error 

Cell C of Table 3 represents the situation in which both studies failed to reject 
the null hypothesis. Under those conditions investigators might conclude that there 
was no relationship between the variables investigated. Such a conclusion could ber^ 
very much in error, the more so as the power of the two studies was low (Cohen, 1977;"^ 
1988; Rosenthal, 1986). If power levels of the two studies (assuming medium effect 
sizes in the population) were very high, say .90 or .95, then two failures to obtain a 
significant relationship would provide evidence that the effect investigated was not 
likely to be a very large effect. If power calculations had been made assuming a very 
small effect size, two failures to reject the null while not providing strong evidence 
for the null would at least suggest that the size of the effect in the population was 
probably quite modest. 

If sample sizes of the two studies failing to reject the null were small so that 
power to detect all but the largest effects was low, very little could be concluded from 
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two failures to reject except that the effect sizes were unlikely to be enormous. For 
example, two Investigators with N's of 20 and 40, respectively, find results not 
significant atp<.05. The effect sizes phi (i.e., r for dichotomous variables) were .29 
and .20, respectively, and both p's were approximately .20. The combined p of these 
two results, however, is .035 [{Z^+Z^) I V2 = Z], and the mean effect size in the 
mid- .20's is not trivial, as we saw irlier in this paper. 

Contrasting Views of Replicati on 

The traditional, not very useful view of replication modeled in Table 3 has two 
primary characteristics: 

(1) It focuses on significance level as the relevant summary statistic of a study, 

and 

(2) It makes its evaluation of whether replication has been successful in a 
dichotomous fashion. For example, replications are successful if both or neither 
p<.05 (or .01, etc.), and they are unsuccessful if one p< .05 (or .01, etc.) and the other 
p>.05 (or .01, etc.). Psychologists' reliance on a dichotomous decision procedure 
accompanied by an untenable discontinuity of credibility in results varying in p 
levels has been well documented (Nelson, Rosenthal, & Rosnow, 1986; Rosenthal & 
Gaito, 1963, 1964). 
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The newer, more usef il views of replication success have two primary 
characteristics:' 

1. A focus on effect size as the more important summary statistic of a study 
with only a relatively minor interest in the statistical significance level, and 

2. An evaluation of whether replication has been successful made in a con- 
tinuous fashion. For example, two studies are not said to be successful or unsuccess- 
ful replicates of each other, but rather the decree of <^'.llure to replicate is specified. 

Insert Table 5 about here 

Table 5 shows three sets of replications. Replication set A shows two results 
both rejecting the null but with a difference in effect sizes of .30 in units of r or .35 in 
units of Fisher's Z transformation of r (Cohen, 1977, 1988; Rosenthal & Rosnow, 
1984; Snedecor & Cochran, 1980). That difference, in units of r or Fisher's Z is the 
degree of failure to replicate. The far', hat both studies were able to reject the null 
and at exactly the same p level is simply a function of sample size. Replication set B 
shows two studies with different p values, one significant at <.05, the other not 
significant. However, the two effect size estimates are in excellent agreement. We 
would say, accordingly, that replication set B shows more successful replication than 
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does replication set A. Replication set C shows two studies differing markedly in 
both level of sfgnificance and magnitude (and direction) of effect size. Replication set 
C, then, is a not very subtle example of a clear failure to replicate. 

It should be noted that the values of Table 5 were chosen so that the combined 
probability of the two studies of sets A, B, and C would all be identical to one 
another; (Zi + Z2)/V2 = Z of 2.77, p = . 0028, one-tailed. 
Some Metrics of the Success of Replication 

Once we adopt a view of the success of replication as a function of similarity of-* 
effect sizes obtained, we can become more precise in our assessments of the success of^ 
replication. 

Insert Figure 1 about here 

The replication diagonal. Figure 1 shows the "replication plane" generated by 
crossing the results of the firsv study conducted (expressed in units of the effect size r) 
by the results of the second study conducted. All perfect replications, those in which 
the effect sizes are identical in the two studies, fall on a diagonal rising from the 
lower left corner (-1.00, -1.00) to the upper right corner ( + 1.00, + 1.00). The results 
of replication set B fi-om Table 5 are shown to fall exactly on the diagonal of perfect 
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replication ( + .26, + .26). The results of replication set A are shown to fallsomewhat 
above the line Representing perfect replication. Figure 1 shows that although set B 
reflects more successful replication than set A, the latter is also located fairly close to 
the line and is, therefore, a fairly successful replication set as well. The results of 
replication set C, however, are shown to fall rather far from the diagonal of perfect 
replication. 

Cohen's q. An alternative to the indexing of the success of replication by the 
difference between obtained effect size r's is to transform the r's to Fisher's Z's before 
taking the difference. Fisher's Z metric is distributed nearly normally and can thus 
be used in setting confidence intervals and testing hypotheses about r's, whereas r's 
distribution is skewed and the more so as the population value of r moves further 
from zero. Cohen's q is especially useful for testing the significance of difference 
between two obtained effect size r's. This is accomplished by means of the fact that 



is distributed as Z, the standard normal deviate (Rosenthal, 1984; Rosenthal & 
Rubin, 1982a, Snedecor & Cochran, 1980). When there are more than two effect size 
r's to be evaluated for their variability (i.e., heterogeneity) we can simply compute 
the standard deviation (S) among the r's or their Fisher Z equivalents. If a test of 
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significance of heterogeneity of these Fisher Z's is desired, the three references above 
all provide the appropriate formula for computing the test of heterogeneity as 
does Hedges (1982). 

Meta-analytic metrics. As the number of replications for a given research 
question grows, a full assessment of the success of the replicational effort requires 
the application of meta-analytic procedures. An informative but slightly unwieldy 
summary of the meta-analysi-^. might be the stem-and- leaf display of the effect sizes 
found-in the meta-analysis (Tukey, 1977). A more compact summary of the effect-* 
sizes might be Tukey's (1977) box plot, which gives the highest and lowest obtained"" 
effect sizes along with those found at the 25th, 50th, and 75th percentiles. For single 
index values of the consistency of the effect sizes, one could employ (a) the range of 
effect sizes found between the 75th (Qg) and 25th (Qj) percentile, (b) some standard 
fraction of that range (e.g., half or three-quarters), (c) S, the standard deviation of 
the effect sizes, or (d) SE, the standard error of the effect sizes. 

As a slightly more complex index of the stability, replicability, or clarity of the 
average effect size found in the set of replicates, one could employ the mean effect 
size divided either by its standard error {S/Vk where k is the total number of 



ERIC 



r: 



17 

replicates), or simply by S. The latter index of mean effect size divided by its 
standard deviation iS) is the reciprocal of the coefficient of variation or a kind of 
coefficient of robustness. 

The coefficient of robustness of replication. Although the standard error of the 
mean effect size along with confidence intervals placed around the mean effect size 
are of great value (Rosenthal & Rubin, 1978), it will sometimes be useful to employ a 
robustness coefficient that does not increase simply as a function of the increasing 
number of replications. Thus, if we want to compare two research areas for theic^ 
robustness, adjusting for the difference in number of replications in each researclw- 
area, we may prefer the robustness coefficient defined as the reciprocal of the 
coefficient of variation. 

The utility of this coefficient is based on two ideas-first, that replication 
success, clarity, or robustness depends on the homogeneity of the obtained effect size; 
and second, that it depends also on the unambiguity or clarity of the directionality of 
the result. Thus, a set of replications grows in robustness as the variance of the 
effect sizes decreases and as the distance of the mean effect size from zero increases. 
Incidentally, the mean may be weighted, unweighted, or trimmed (Tukey, 1977). 
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Indeed, it need not be the mean at '^11 but any measure of location or central 
tendency (e.g.,^the median). 

Insert Tables 6 & 7 about he 'e 

Table 6 has been prepared to give some feel for the practical meaning of several 
degrees of variability (S) for seven sets of five replicates each, assuming a mean 
effect size of zero. For our effect size indicator we have employed the Fisher Z trans- 
formation of the- correlation coefficient r. When the range of the five Zr's is only from-* 
-.02 to +.02, S = .016; when the range is from -1.00 to +1.00, S = .791. Table 7"" 
shows the replication robustness coefficients for each of the seven degrees of 
variability (S) for each of four levels of mean effect size (Zr): .10, .30, .50, and .70. 

There are no intrinsic meanings to any particular robustness coefficients. 
Instead, they are intended to be used to compare different research domains for their 
replicational robustness in a merely heuristic way. 
What Should Be Reported? 

If we are to take seriously our newer view of the meaning of the success of 
replications, what should be reported by authors of papers seen to be replications of 
earlier studies? Clearly, reporting the results of tests of significance will not be 
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sufficient. The effect size of the replication and of the original study must be 
reported. It is not crucial which particular effect size is employed, but the same 
effect size should be reported for the replication and the original study. Complete 
discussions of various effect sizes and when they are useful are available from Cohen 
(1977, 1988) and elsewhere (e.g., Rosenthal, 1984). If the original study and its 
replication are reported in different effect size units these can usually be translated 
to one another (Cohen, 1977, 1988; Rosenthal, 1984; Rosenthal & Rosnow, 1984; 
Rosenthal & Rubin, in press). 

Especially if the results of either the original study or its replication were not" 
significant, the statistical power at which the test of significance was made (assum- 
ing, for example, a population effect size equivalent to the effect size actually 
obtained) should be reported (Colien, 1988). In addition to reporting the statistical 
power for each study separately, it would be valuable to report the overall 
probability that both studies would have yielded significant results given, for 
example, the effect size estimated from the results of the original and the repHcation 
study combined. 

As an illustration of this procedure, consider the data of Table 4. Employing 
Cohen's power tables tells us that given an effect size of d = .50, Smith's power to 



20 



20 

reject atp^ .05, two-tailed was .60 while Jones's power was .18. Table 8 shows that 
given these twa levels of power there were only 11 chances in a hundred that both 
studies would reject the null hypothesis given the effect size d = .50. Indeed, the odds 
were three times greater (p - .33) that neither study would reject the null hypothesis 
than that both would reject! 

Insert Table 8 about here 

Such results are not at all unusual. It has often been documented that behav 
ioral researchers are far fonder of making type 11 errors than of making type I errors-^ 
(Cohen, 1962, 1988; Rosenthal & Rosnow, in press; Rosenthal & Rubin, 1985; 
Sedlmeier & Gigerenzer, in press). It has been suggested that it is part of our Judeo- 
Christian-Shinto tradition that we be deeply troubled that somewhere out there 
someone might be having a good time, could be getting a free ride, a significant 
result iney don't deserve, an .05 asterisk that was actually intended for someone 
else. 

A marvelous suggestion has been made by Donald Rubin that would a long 
way toward helping us get over our problem with the relative risks of type II versus 
type I errors. Don has suggested that whenever we conclude that there is "no effect" 
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we report the effect size along with that confidence interval around the effect size 
-that ranges from the effect size of zero to the equally likely effect size- greacer than 
the one we obtained. 

To return to Table 4, the "failure to replicate" by Jones provides a good 
example. Jones did not reject the null but obtained an effect S'ze of d=.&0. If Jones 
had been required to report that his d of. 50 was just as close to a d of 1.00 as it was to 
a d of zero, Jones would have been less likely to draw his wrong conclusion. 

Meta-Analytic Procedures: An Evaluation r' 

Of course it was bound to happen. No discussion of replication and of the"" 
evaluation of the success of a particular replication could long avoid a more formal 
consideration of meta-analytic procedures. 

In the years 1980, 1981, and 1982 alone, well over 300 papers were published 
on the topic of meta-analysis (Lamb and Whitla, 1983/. Does this represent a giant 
stride forward in the development of the behavioral and social sciences or does it 
signal a lemming-like flight to disaster? Judging from reactions to past meta- 
analytic enterprises, there are at least some who take the more pessimistic view. 
Some three dozen scholars were invited to respond to a meta-analysis of studies of 
interpersonal expectancy effects conducted by Don Rubin and myself (Rosenthal & 
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Rubin, 1978). Although much of the commentary dealt with the substantive topic of 
interpersonare'xpectancy effects, a good deal of it dealt with methodological aspects 
of meta-analytic procedures and products. Some of the criticisms offered were 
accurately anticipated by Glass (1978) who had earlier received commentary on his 
meta-analytic work (Glass, 1976) and that of his colleagues (Smith & Glass, 1977; 
Glass, McGaw, & Smith, 1981). In the present discussion, the criticisms of our 
commentators are grouped .nto several conceptual categories, described, and 
discussed. 

Sampling Bias and the File Drawer Problem 
This criticism holds that there is a re triev ability bias such that studies 
retrieved do not reflect the population of studies conducted. One version of this 
criticism is that the probability of publication is increased by the statistical signifi- 
cance of the r suits so that published studies may not be representative of the studies 
conducted. This is a well-taken criticism, though it applies equally to more 
traditional narrative reviews of the literature. Procedures that can be employed to 
address this problem have been described elsewhere (Rosenthal, 1979a; 1984, 
Chapter 5; Rosenthal & Rubin, 1988). 
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Loss of Information 

Overemphasis on Single Values 

The first of two criticisms relevant to information loss notes the danger of 
trying to summarize a research domain by a single value such as a mean effect size. 
This criticism holds that defining a relationship in nature by a single value leads to 
overlooking moderator variables. When meta-analysis is seen as including not only 
combining effect sizes (and significance levels) but also comparing effect sizes in both 
diffuse and, especially, in focused fashion, the force of this criticism is removed-*- 
(Rosenthal, 1984, Chapter 4). 

Overlooking negative instances. A special case of the criticism under discussion 
is that, by emphasizing average values, negative cases are overlooked. There are 
several ways in which negative cases can be defined; e.g., p>.05, r = 0, r negative, r 
significantly negative, and so on. However we may define negative cases, when we 
divide the sample of studies into negative and positive cases we have merely 
dichotomized an underlying continuum of effect sizes or significance levels, and 
accounting for negative cases is simply a special case of finding moderator variables. 
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Glossing over Details 

Although-it is accurate to say that meta-analyses gloss over details, it is 
equally as accurate to say that traditional narrative reviews do so, and that data 
analysts do so in every study in which any statistics are computed. The act of sum- 
marizing requires us to gloss over details. If we describe a nearly normal distribu- 
tion of scores by the mean and a we have nearly described the distribution perfectly. 
If the distribution is quadrimodal, the mean and a will not do a good job of summar- 
izing the data. It is the data analyst's job in the individual study, and the meta-*< 
analyst's job in meta-analysis, to "gloss well.' Providing the reader with all the raW^ 
data of all the studies summarized avoids this criticism but serves no useful review 
function. Providing the reader with a stem-and-leaf display of the effect sizes 
obtained, along with the results of the diffuse and focused comparisons of effect sizes, 
does some glossing, but it does a lot of informing besides. 

There is, of course, nothing to prevent the meta-analyst from reading each 
study as carefully and assessing it as creatively as might be done by a more 
traditional reviewer of a literature. Indeed, we have something of an operational 
check on reading articles carefully in the case of meta-analysis. If we do not read the 
results carefully, we cannot obtain effect sizes and significance levels. In traditional 
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reviews, results may have been read carefully or not read at all, with the abstract or 
the discussion section providing "the results" to the more traditional reviewer. 

Problems of Heterogeneity 

Heterogeneity of Method 

The first of ' o criticisms relevant to problems of heterogeneity notes that 
meta-analyses average over studies in which the independent variables, the depen- 
dent variables, and the samphng units are not uniform. How can we speak of inter 
personal^xpectancy effects, meta-analytically, when some of the independent-* 
variables are operationalized by (a) telling experimenters that tasks are easy versus*" 
hard; or by (b) telling experimenters that subjects are good versus poor task per- 
formers? How can we speak, meta-analytically, of these expectancy effects when 
sometimes the dependent variables are reaction times, sometimes IQ test scores, and 
sometimes responses to inkblots? How can we speak of these effects when sometimes 
the sampling units are rats, sometimes college sophomores, sometimes patients, 
sometimes pupils? Are these not all vastly differ \t phenomena'' How can they be 
pooled together in a single meta-analysis? 

Glass (1978) has eloquently addressed this issue-the apples and ot anges issue. 
They are good things to mix, he wrote, when we are trying to generalize to fruit. 

ERIC 
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Indeed, if we are willing to generalize over subjects within studies, why should we 
not be willing'to generalize over studies? If subjects behave very differently within 
studies we block on subject characteric»tics to help us understand why. If studies 
yield very different results from each other, we block on study characteristics to help 
us understand why. It is very useful to be able to make general statements about 
fruit. If, in addition, it h also useful to make general statements about apples, about 
oranges, and about the differences between them, there is nothing in meta-analytic 
procedures to prevent us from doing so. 
Heterogeneity of Quality 

One of the most frequent criticisms of meta-analyses is that "bad" studies are 
thrown in with good. This criticism must be broken down into two questions: (a) 
What is a "bad" study?, and (b) What shall we do about "bad" studies? 

Defining "bad" studies. Too often, deciding what is a "bad" study is a procedure 
richly susceptible to bias or to claims of bias (Fiske, 1978). "Bad" studies are too often 
those whose results we don't like, or, as Glass, McGaw, and Smith (1981) have put it, 
the studies of our "enemies." Therefore, when reviewers of research tell us they have 
omitted the "bad" studies, we should satisfy ourselves that this has been done by 
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criteria we find acceptable. A discussion of these criteria (and the compuUtion of 
their reliability) can be found elsewhere ^Rosenthal, 1984, Chapter 3). 

Dealing with '"bad*' studies. The distribution of studies on a dimension of 
quality is, of course, not really dichotomous (good versus bad), but continuous with 
all possible degrees of quality. The fundamental method of coping with "bad" studies 
or, more accurately, variations in the quality of research, is by differential weighting 
of studies. Dropping studies is merely the special case of zero weighting. 

The most important question to ask relevant to study quality is that asked by- 
Glass (1976): Is there a relationship between quality of research and effect size"^ 
obtained? If there is not, the inclusion of poorer quality studies will have no effect on 
the estimate of the average effect size though it will help to decrease the size of the 
confidence interval around that mean. If there is a relationship between the quality 
of research and effect size obtained, we can employ whatever weighting system we 
find reasonable (and that we can persuade our colleagues and critics also to find 
reasonable). 
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Problems of Independence 

Responses Wtthin Studies 

The first of two criticisms relevant to problems of independence notes that 
several effect size estimates and several tests of significance may be generated by the 
same subjects within each study. This can be a very well-taken criticism under some 
conditions and the problem has been dealt with elsewhere in some detail (Rosenthal, 
1984, Chapter 2; Rosenthal & Rubin, 1986). 
Studies Within Sets of Studies 

- Even when all studies yield only a single effect size estimate and level o^=- 
significance, and even when all studies employ sampling units that do not also 
appear in other studies, there is a sense in which results may be nonindependent. 
That is, studies conducted in tb« same laboratory, or by the same research group, 
may be more similar to each other (in the sense of an intraclass correlation) than 
they are to studies conducted in other laboratories or by other research groups (Jung, 
1978; Rosenthal, 1966, 1969, 1979b). The conceptual and statistical implications of 
this problem are not yet well worked out. 
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Exaggeration of Significance Levels 
Truncating Significance Levels 

It has been suggested that all p levels less than .01 (Z values greater chan 2.33) 
be reported as .01 (Z = 2,33) because p's less than .01 are likely to be in error 
(Elashoff, 1978). This truncating of Zs cannot be recommended and will, in the long 
run, lead to serious errors of inference (Rosenthal & Rubin, 1978). If there is reason 
to suspect that a given p level < .01 is in error it should, of course, be corrected before 
employing it in the meta-analysis. It should not, however, be changed to p = .01~' 
simply because it is less than .01. 
Too Many. Studies 

It has been noted as a criticism of meta-analyses, that, as the number of studies 
increases, there is a greater and greater probability of rejecting the uull hypothesis 
(Mayo, 1978). When the null hypothesis is false and, therefore, ought to be rejected, 
it is indeed true that adding observations (either sampling units within studies or 
new studies) increases statistical power. However, it is hard to accept as a legitimate 
criticism of a procedure, a characteristic that increases its accuracy and decreases its 
error rate- in this case, type II errors. 

30 
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A related feature of meta-analysis appears to be that it may, in general, lead to 
a decrease in-type n errors even when the number of studies is modest. Empirical 
support for this is provided in a study conducted by Cooper and Rosenthal (1980). 
Procedures requiring the research reviewer to be more systematic and to use more of 
the information in the data seem to be associated with increases in power, i.e., 
decreases in type II errors. 

Some Benefits of Me ta- Analysis 

From what has-been said of the various criticisms of meta-analysis it will_^ 
surprise no one to learn that I strongly support the increasing use of meta-analytic^ 
procedures. My reasons for that support go beyond the fact that the various 
criticisms of meta-analysis can be readily addressed. In the time that remains I 
want to note a number of special benefits of meta-analysis. Some of these benefits 
are well known, but some are not-indeed, some are almost secret benefits. 
Most Obvious Benefits 

Completeness. Meta-analytic consideration of a research domain is more 
complete and exhaustive though this does not mean that all studies found are 
weighted equally. Indeed, every ^cudy should be weighted from zero to any desired 
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number. These weights, of course, must be defensible. (It will not do to v/eight all 
my results + f.OO and all my enemies' results 0.00). 

Explicitness. The quantitative nature of the process of obtaining effect sizes, 
standard normal deviates, and weights, forces explicitness on the analyst. Vague 
terms like "no relationship," "some relationship," a "strong relationship," "very 
significant," are replaced by numerical values. 

Power. Empirical work has shown that meta-analytic procedures increase 
power and decrease type 2 errors (Cooper & Rosenthal, 1980). 
Less Obvious Benefits 

Moderator variables. These are more easily spotted and evaluated in a context 
of a quantitative research summary. This aids theory development and increases 
empirical richness. 

Cumulation problems. Meta-analytic procedures address, in part, the chronic 
complaint that social sciences cumulate so poorly compared to the physical sciences. 
It should be noted that recent historical and sociological investigations have sug- 
gested that the physical sciences may not be all that much better off than we are 
when it comes to successful replication (Collins, 1985; Hedges, 1987; Pool 1988). For 
example, Collins (1985) has described the failures to replicate the construction of 
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TEA-lasers despite the availability of detailed instructions for replication. 
Apparently 'KJA-lasers could be replicated dependably only when the replication, 
instructions were accompanied by a scientist who had actually built a laser. 
Least Obvious Benefits 

Decrease in overemphasis on single studies. One not so obvious benefit that will 
accrue to us is the gradual decrease in the overemphasis on the results of a single 
study. There are good sociological grounds for our monomaniacal preoccupation 
with the results of a single study. Those grounds have to do with the reward system^., 
of science where recognition, promotion, reputation, and the like depend on the^ 
results of the single study, also known as the smallest unit of academic currency. 
The study is "good," "valuable," and above all, "publishable" when p ^ .05. Our disci- 
plines would be further ahead if we adopted a more cumulative view of science in 
which the impact of a study were evaluated less on the basis of p levels, and more on 
ihe basis of its own effect size and on the revised effect size and combined probability 
that resulted from the addition of the new study to any earlier studies investigating 
the same or a similar relationship. This, of course, amounts to a call for a more meta- 
analytic view of "doing science." 
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B. F .-Skinner has been eloquent in his comments on the overvaluation of the 
single study: ^In my own thinking, I try to avoid the kind of fraudulent significance 
which comes with grandiose terms or profound 'principles.' But some psychologists 
seem to need to feel that every experiment they do demands a sweeping reorgan- 
ization of psychology as a whole. It's not worth publishing unless it has some such 
significance. But research has its own values, and you don't need to cook up spurious 
reasons why it's important." (Skinner, 1983, p. 39). 

''The new intimacy." This new intimacy is between the reviewer and the data.-* 
We cannot do a meta-analysis by reading abstracts and discussion sections. We are"^ 
forced to look at the numbers and, very often, compute the correct ones ourselves. 
Meta-analysis requires us to cumulate data, not conclusions. "Reading" a paper is 
quite a different matter when we need to compute an effect size and a fairly precise 
significance level-often from a results section that never heard of effect sizes, precise 
significance levels (or the iPA publication manual)! 

The demise of the dichotomous significance testing decision. Far more than is 
good for us, social and behavioral scientists operate under a dichotomous null 
hypothesis decision procedure in which the evidence is interpreted as anti-null if p ^ 
.06 and pro-null if p > .05. If our dissertation p is < .05 it means joy, a Ph.D., and a 
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tenure-track position at a major university. If our p is > .05 it means ruin, despair, 
and our advisor's suddenly thinking of a new control condition that should be run. 
That attitude really must go. God loves the .06 nearly as much as the .06. Indeed, I 
have it on good authority that she views the strength evidence for or against the 
null as a fairly continuous function of the magnitude of p. As a matter of fact, two .06 
results are much stronger evidence against the null than one .05 result; and 10 p's of 
.10 are stronger evidence against the null than 5 p's of .05. 

The overthrow of the omnibus test. It is xommon to find specific questions-" 
addressed by F tests with df > 1 in the numerator or by tests with df> I. Fof^ 
example, suppose the specific question is whether increased incentive level improves 
the productivity of work groups. We employ four levels of incentive so that our 
omnibus F test would have 3 df in the numerator or our omnibus would be on at 
least 3 df Common as these 'ests arf> they reflect poorly on our teaching of data 
analytic procedures. The diffuse hypothesis tested by these omnibus tests usually 
tells us nothing of importance about our research question. The rule of thumb is 
unambiguous: Whenever we have tested a fixed effect with df > 1 for X^ or for the 
numerator of F, we have tested a question in which we are almost surely not 
interested. 
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The situation is even worse when there are several dependent variables as well 
as multiple df lor the independent variable. The paradigm case here is canonical 
correlation and special cases are MANOVA, MANCOVA, Multiple discriminant 
function, multiple path analysis, and complex multiple partial correlation. While all 
of these procedures have useful exploratory data analytic applications they are 
commonly used to test null hypotheses which are scientifically almost always of 
doubtful value. The effect size estimates they yield (e.g., the canonical correlation) 
are also almost always of doubtful value. 

This is not the place to go into detail, but one approach to the problem ©r* 
analyzing canonical data structures is to reduce the set of dependent variables to 
some smaller number of composite variables using the principal-components- 
foUowed-by-unit-weighting approach. Each composite can then be analyzed serially. 

Meta-analytic questions are basically contrast questions. F tests with df> 1 in 
the numerator or X^'s with df>\ are useless in meta-analytic work. That leads to 
an additional scientific benefit: 

The increased recognition of contrast analysis. Meta-analytic questions require 
precise formulation of questions and contrasts are procedures for obtaining answers 
to such questions, often in an analysis of variance or table analysis context. 
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Although most textbooks of statistics describe the logic and the machinery of 
contrast analyses, one still sees contrasts employed all too rarely. That is a real pity 
given the precision of thought and theory they encourage and (especially relevant to 
these times of publication pressure) given the boost in power conferred with the 
resulting increase in .05 asterisks (Rosenthal & Rosnow, 1985). 

A probable increase in the accurate understanding of interaction effects. 
Probably the universally most misinterpr ted empirical results in psychology are 
the results of interaction effects. A recent survey of 191 research articles involving__^ 
interactions found only two articles that showed the authors interpreting inter-^ 
actions in an unequivocally correct manner (i.e., by examining the residuals that 
define the interaction) (Rosnow & Rosenthal, 1989). The rest of the articles simply 
compared means of conditions with other means, a procedure that does not 
investigate interaction effects but rather the sum of main effects and interaction 
effects. 

Most standard textbooks of statistics for psychologists provide accurate 
mathematical definitions of interaction effects but then interpret not the residuals 
that define those interactions but the means of cells that are the sums of all main 
effects and all interactions. 
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In addition, users of SPSS, SAS, BMDP, and virtually all other data-analytic 
software are poorly served in the matter of interactions since virtually no programs 
provide convenient tabular output giving the residuals defining interaction. The 
only exception to that of which I am aware is a little-known package called Data- 
Text developed by Arthur Couch and David Armor for which William Cochran and 
Donald Rubin provided the statistical consultation. 

Since many meta-analytic questions are by nature questions of interaction (for 
- example, that opposite sex dyads will conduct standard transactions more slowly-* 
than will same sex dyads), we can be hopeful that increased use of meta-analytic^ 
procedures will bring with it increased sophistication about the meaning of 
interaction. 

Meta-analytic procedures are applicable beyond meta-analyses. Many of the 
techniques of contrast analyses among effect sizes, for example, can be used within a 
single study (Rosenthal & Rosnow, 1985). Computing a single effect size from 
correlated dependent variables, or comparing treatment effects on two or more 
dependent variables serve as illustrations (Rosenthal & Rubin, 1986). 

The decrease in the splendid detachment of the full professor. Meta-analytic 
work requires careful reading of research and moderate data analytic skills. We 
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cannot send an undergraduate research assistant to the library with a stack of 5 X 8 
cards to bring us bacK "the results." With narrative reviews that seems often to have 
been done. With meta-analysis the reviewer must get involved with the actual data 
and that is all to the good. 

Conclusion 

I hope that this paper has provided some comfort to the afflicted in shewing 
that many of the findings of our discipline are neither as small nor as unimportant 
from a practical point of view as we may have feared. Perhaps I hope, too, that there- 
may have been some affliction of the comfortable in showing that in our views of^ 
replication and of the cumulation of the wisdom of our field there is much yet 
remaining to be done. 
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Tablel 

Effects of Aspirin on Heart Attacks Among 22,000 Physicians 

No Heart 



Heart Attack Attack Total 

I. Raw Counts 

Aspirin 104 10,933 11,037 

Placebo 189 10,845 11,034 

Total 293 21,778 22,071 

II. Percentages 

Aspirin 0.94 99.06 100 

Placebo 1.71 98.29 100 

Total 1.33 98.67 100 

III. Binomial Effect Size Display 

Aspirin 48.3 51.7 100 

Placebo 51.7 48.3 100 

Total 100 100 200 




Table 2 _ 

Other Examples of Binomial Effect Size Displays 



I. Vietnam Service and Alcohol Problems (r = .07) 

Problem No Problem Total 

Vietnam Veteran 53.5 46.5 100 

Non-Vietnam Veteran 46.5 53.5 100 

Total 100 100 200 

II. AZT in the Treatment of AIDS (r = .23) 

Death Survival . Total 

AZT 38.5 61.5 100 

Placebo 61.5 38.5 100 

Total 100 100 200 

III. Benefits of Psychotherapy (r = .32) 

Greater 

Less Benefit Benefit Total 

Psychotherapy 34 66 100 

Control 66 34 100 

Total 100 100 200 



48 



Table 3 

Common Model of Successful Replication: Judgment is Dichotomous and Based on 
Significance Testing 

First Study 



Second 
Study 





p > .05* 


p < .05 




A 


B 


p < .05^ 


Failure to 


Successful 




Replicate 


Replication 




C 


D 


p > .05 


Failure to 


Failure to 




Establish Effect 


Replicate 



*By convention .05 but could be any oUier given level, e.g. .01. 
•^In the same tail as the results of the fin't study. 
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Table 4 

Illustrative Results of an Experiment and It. Replication 

Investigator 





I. Smith 


II. Jones 


Combined 


Treatment Mean 


.38 


.36 


.376 


Control Mean 


.26 


.24 


.256 


Difference 


.12 


.12 


.120 


t 


2.21 


1.06 


2.45 




1 o 


iO 




two-tail p 


AO 

.(Jo 




no 


effect size 


.50 


.50 


.50 


effect size 


.24 


.24 


.24 


standard normal Z 


2.17*^ 


1.03*^ 


2.40 


96% Confidence intervals 








From: 


.01 


-.12 


.02 


Mean differences 








To: 


.23 


.36 


.22 


From: 


.02 


-.23 


.04 


Effect size r s 








To: 


.44 


.62 


.42 



^Obtained from 2tVdf. 

b Obtained from VfiiF+df)'. 

These significance levels differ atZ = .81,p = .42 from 

(Zj-Z.p/V2. 
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Table 5 

Comparison of Three Sets of Replications 

Replication Sets 



ABC 
Study I Study 2 Study 1 Study 2 Study 1 Study 2 



N 


96 


15 


98 


27 


12 


32 


p (two-tail) 


.05 


.05 


.01 


.18 


.000001 


.33 


Zip) 


1.96 


1.96 


2.58 


1.34 


4.89 


-0.97 


r 


.20 


.50 


.26 


.26 


.72 


-.18 


Zir) 


.20 


.55 


.27 


.27 


.90 


-.18 


Cohen's q (Z- ) 


.35 




.00 




1.08 
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Table 6 

Seven Degrees of Variability (S) of Effect Sizes (Zr) Around a Mean Effect Size of 0.00 

Degree of Variability 



Replicate 


Setl 


Set 2 


Sets 


Set 4 


Sets 


Set 6 


Set 7 




.02 


.10 


.20 


.40 


.60 


.80 


1.00 




.01 


.05 


.10 






.41) 






.00 


.00 


.00 


.00 


.00 


.00 


.00 




-.01 


-.05 


-.10 


-.20 


-.30 


-.40 


-.50 




-.02 


-.10 


-.20 


-.40 


-.60 


-.80 


-1.00 


S 


.016 


.079 


.158 


.316 


A14 


.632 


.791 


Range 


.04 


.20 


.40 


.80 


1.20 


1.60 


2.00 


Equal Steps of 


.01 


.05 


.10 


.20 


.30 


.40 


.50 



ERIC 
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Table 7 

Replication Robustness Coefficients for Four Levels of Mean Effect 
Size iZr) and Six Degrees of Variability of Effect Size (S) 



Mean Effect Size (Zr) 



s 


.10 


.30 


.50 


.70 


.016 


6.25 


18.75 


31.25 


43.75 


.079 


1.27 


3.80 


6.33 


8.86 


.158 


0.63 


1.90 


3.16 


4.43 


.316 


0.32 


0.95 


1.58 


2.22 


.474 


0.21 


0.63 


1.05 


1.48 


.632 


0.16 


0.47 


0.79 


1.11 


.791 


0.13 


0.38 


0.63 


0.88 
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Table 8 

Probabilities of Various Combinations of Rejecting the Null Hypothesis for the Two 
Studies of Table 4 



Study 11: Jones 

Probability of 
Rejecting False Null 
(Power = .18) 



Probability of Not 
Rejecting False Null 
(Type II Error Rate 
= .82) 



Study I: Smith 


Probability of Not 
Rejecting False Null 
(Type II Error Rate = 

10) 


Probability of 
Rejecting False Null 
(Power = .60) 


.07 


.11 


.33 


.49 


.40 


.60 


54 





V 



.18 



.82 — 



1.00 



ERIC 



Figure I 
The Replication Plane 




RESULTS OF FIRST STUDY (r) 



